Chapter 1: Hi random referee

This is my course page (Juha/monnis). Here is the link to my github repository.

This link leads to the R markdown cheat sheet. Should be useful and stuff.


Chapter 2: Regression and model validation

Part 0: Libraries used and overall setup

Disclaimer: this first part looks a bit awful, but bear with me.

rm(list=ls())

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Part 1: The data Structure of the data and the dimensions of the data

A sample of students giving answers to questions related to how they learn stuff and some background characteristics (e.g. age, gender). There is a grand total of 166 observations and 7 variables; one factor variable expressing gender,three integer variables expressing age, attitude & points for some exam and three numeric variables expressing indexes that are based on questions related to their methods of learning stuff. More precisely, whether their learning exhibits signs of so called strategic learning, deep learning and/or surface learning.

setwd("~/GitHub/IODS-project/data")

students2014 <- read.table("learning2014.txt")
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

Part 2: Overview of the data

The students appear to be mostly female (110 as opposed to 56 males) and their median age is 22. Their point distribution is skewed to the right, which indicates that they are a quite capable lot. Their attitude seems to be the most important factor in determining the points with a whopping correlation of 0.437. Out of the learning indexes, the elements of stategic learning corralete positively with the points they receive, while surface learning’s magnitude is the same, but the sign is negative. Deep learning index matters the least, with close to zero correlation. Thus, it seems that the points earned favour strategic learning.

summary(students2014)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
ggpairs(students2014, lower = list(combo  =wrap("facethist", bins=20)))

Part 3: Linear regression model

My chosen variables are attitude, stratetegic learning index (stra) and surface learning index (surf).

Out of the components, only attitude is statistically significant (***), which means that we can REJECT the null hypothesis (null hypothesis means that we presume the chosen regressor’s effect to be zero to begin with, i.e. that it doesn’t have any effect on the dependent variable). However, we cannot reject the null hypotheses regarding the chosen learning indexes. Perhaps, determining their role would require more observations or maybe more variation within the data or perhaps they don’t have any predicting power against the chosen dependent variable. For now, we can just say that with the given data, we cannot observe a statistically significant relationship between points and the learning indexes.

As for the significant regressor, the results imply that one point of the attitude metric predicts roughly 0.34 points in the exam. The intercept is ~11, meaning that this is the model’s baseline points for everyone - points that even a person with an attitude score of zero can get. However, this is a hypothetical case, as the minimum of attitude points in the sample is 14 points.

Linear_regression <- lm(Points ~ Attitude + stra+ surf, data = students2014)

summary(Linear_regression)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Part 4: Variation of the model

Removing the nonsignificant variables.

For the model parameters, the results imply that one point of the attitude metric predicts roughly 0.35 points in the exam. The intercept is ~11.6, meaning that this is the model’s baseline points for everyone - points that even a person with an attitude score of zero can get. However, this is a hypothetical case, as the minimum of attitude points in the sample is 14 points. Removing the nonsignificant independent variables does not change the results much.

The R squared gains the value of 0.1906. This is the percentage of the variation that the linear model explains out of the total variation when compared to a model with no independent variables. In other words, we compare the residuals of a “baseline model” to the residuals of our fitted model. The baseline model is a prediction-wise useless model which always predicts the mean of observed dependent variable and does not use the independent variables at all. So basically, R squared explains how much more of the total variance, the chosen independent variables explain when compared to the baseline model. In an optimal scenario, an R squared of one would mean that the data points would overlap the regression line perfectly.

Ultimately, it must be noted that the R squared alone is not enough to indicate whether the model is bad (or good) - for instance a low R squared can be totally understandable, considering the complexity of the statistical relationship we are trying to model. Is this a big R squared? Well, to the best of my knowledge, in social sciences a low R squared can be justified, and this is not even that low. It all depends on the context: modelling the weather or macroeconomic fluctuations or the role of certain genes explaining the onset of psychosis are bound to have varying coefficients of determination.

Linear_regression2 <- lm(Points ~ Attitude, data = students2014)

summary(Linear_regression2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Part 5: diagnostics

Naturally, there are more or less crucial assumptions behind the linear regression model.

  1. The relationship modelled is linear.
  2. The noise, or the error term, is normally distributed (mean of zero and constant variance).
  3. The errors are not correlated, which means that there is no omitted variable bias or other source of other sources of endogeinety present such as reverse causality or systematic measurement errors.

Analyzing the residuals of the model- the difference between the fitted observations and the actual observations - gives insight whether these assumptions are violated.

The QQ-plot explores the normality assumption. In this case, the normality is somewhat satisfied with only some deviation from the line in the both ends of it.

The residuals versus fitted values plot explorest the constant variance assumption. As the name states, we try to find any patterns across the plotted values. Nothing alarming here either - the points are spread out somewhat randomly.

Finally, the leverage plot can help identify outliers’ impact on the model parameters. Outliers are problematic as they often don’t represent the sample, and can make the regression slope deviate from its “more true” course. Here, no outliers really stand out, which is great!

par(mfrow = c(2,2))

plot(Linear_regression2, c(1,2,5))


Chapter 3: Logistic regression

Part 0: Libraries used and overall setup

Disclaimer: this first part looks a bit awful, but bear with me.

rm(list=ls())

library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr)

Part 2: The data

This data contains information about Portuguese students and their endeavours in life. We use it to analyze the subtle relationship of alcohol consumption and school performance. The variables are listed below.

setwd("~/GitHub/IODS-project/data")

alc <- read.table("alc.txt")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Part 3: Choosing the variables My variables of choice are:

absences - Which comes first: absences or alcohol use? Causality is a bit questionable, but one way to hypothesize this is that more hangovers ( ans as such absences) means more alcohol use. There is a likely positive correlation.

romantic - Most persons in romantic relationships tend to become ex-drinking buddies. This is an observation in life, not a hypothesis :D

famsize - Family size might be an unintuitive choice at first glimpse, but should one have many siblings, some of them might be older and thus be able to purchase alcohol for their younger siblings. Personal experiances tell me, that this is one of the main supply channels of alcohol for young people.

studytime - Time spent studying is time away from drinking - as time spent drinking is time away from studying. Causality is again bit hard, but there will likely be a negative correlation.

Part 4: Exploring the chosen variables

The graphs indicate that over 2/3 of the students drink moderately. Two thirds is not romantically involved. Two thirds of the sample have more than 3 family members and it their weekly study time rarely exceeds 10 hours.

Regarding the cross-tabulations, I grouped moderate (including teatotallers) and heavy users i) according to their family size and ii) according to their romantic status.

It seems that children in bigger families do not drink heavily: It is more common for the children in the smaller families to consume high amounts of alcohol. On average, the high-users from bigger families are less absent than high-users from smaller families. The high-users average study time is smaller than of the student’s who drink moderately, but there seems to be no big (or interestening) difference regarding the family size. My hypothesis regarding family size seems to be highly questionable.

Being in a romantic relationship seems to be associated with lower levels of alcohol consumption if looking at the case amounts. Makes sense regarding my hypothesis. 45% of non-relationship students drink heavily as opposed to 37,5% of the couples-people.

For persons in a romantic relationship, there seem to be more absences for both the moderate and heavy drinkers when compared to non-relationship persons (probably spending time together, awww). If the romantic status is associated with high levels of alcohol use, the mean absences get even bigger (maybe they are drinking together more often?).

It must be noted, that heavy drinkers might not be that great significant other -material: they might be more interested in partying than datin, as such there might be selection problems and numerous other mechanisms/channels that I haven’t even thought of.

chosenvars <- c("absences", "romantic", "famsize", "studytime", "high_use", "sex")
alc2 <- select(alc, one_of(chosenvars))
str(alc2)
## 'data.frame':    382 obs. of  6 variables:
##  $ absences : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famsize  : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ studytime: int  2 2 2 3 2 2 2 2 2 2 ...
##  $ high_use : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
gather(alc2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

alc %>% group_by(famsize, high_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_studytime = mean(studytime))
## # A tibble: 4 x 5
## # Groups:   famsize [?]
##   famsize high_use count mean_absences mean_studytime
##   <fct>   <lgl>    <int>         <dbl>          <dbl>
## 1 GT3     FALSE      201          3.75           2.17
## 2 GT3     TRUE        77          6.05           1.79
## 3 LE3     FALSE       67          3.58           2.07
## 4 LE3     TRUE        37          7.03           1.73
alc %>% group_by(romantic, high_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_studytime = mean(studytime))
## # A tibble: 4 x 5
## # Groups:   romantic [?]
##   romantic high_use count mean_absences mean_studytime
##   <fct>    <lgl>    <int>         <dbl>          <dbl>
## 1 no       FALSE      180          3.41           2.16
## 2 no       TRUE        81          5.78           1.69
## 3 yes      FALSE       88          4.32           2.12
## 4 yes      TRUE        33          7.82           1.97

Part 5: Logistic regression to the rescue

It must be noted, that intuition tells me that the regression equation is “a bit dirty”. Study time and absences are probably correlated as heck, and combining an alcohol use indicator with absences and study time as such, well it sounds like an unholy triangle regarding the statistical inference.

Interpretation of the summary and the odds ratios

From my variables, study time and absences are statistically significant. More time spent studying indicates less heavy boozing and more absences probably indicates recovering from this heavy boozing (or something else, this is just a hypothesis).

The coeffient for study time is -0.53 (OR: 0.59) and the coefficient for absences 0.08 (OR: 1.09). The meaning of the raw coefficients is not that intuitive. The signs [+,-] and the magnitude of the coefficients are a bit informative, but if we exponent them, we change them to odds ratios.

For example an odds ratio of 0.59 with the study time indicates that a jump into a higher studying group (or category, as this is a categorical variable gaining the values 1-4) decreases the odds of drinking heavily. Vice versa, this means that a jump to a group of students that studies less (as it is a categorical variable gaining the values 1-4), makes the odds of drinking heavily (1/0.59) 1.69 times bigger. For the absences, one extra absence makes the odds of being a heavy drinker 1.08 times bigger.

m <- glm(high_use ~ absences + romantic + famsize + studytime, data = alc, family = "binomial")

OR <- coef(m) %>% exp

CI <- confint(m) %>% exp
## Waiting for profiling to be done...
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + romantic + famsize + studytime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1302  -0.8348  -0.6421   1.1824   2.2477  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.22264    0.35314  -0.630 0.528393    
## absences     0.08236    0.02300   3.581 0.000342 ***
## romanticyes -0.26346    0.25913  -1.017 0.309299    
## famsizeLE3   0.31521    0.25692   1.227 0.219867    
## studytime   -0.53031    0.15542  -3.412 0.000645 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 431.24  on 377  degrees of freedom
## AIC: 441.24
## 
## Number of Fisher Scoring iterations: 4
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.8004009 0.3996898 1.6005996
## absences    1.0858500 1.0402630 1.1385241
## romanticyes 0.7683888 0.4578880 1.2679157
## famsizeLE3  1.3705508 0.8239609 2.2609734
## studytime   0.5884228 0.4296281 0.7912992

Part 6: Exploring the predictive power

Well, it seems that my model is not that convincing.

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)

alc <- mutate(alc, prediction = probability>0.5)
## Warning: package 'bindrcpp' was built under R version 3.5.1
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   257   11
##    TRUE     97   17
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

g <- g +geom_point()
g


Chapter 4: Clustering and classification

Part 1: Libraries used and overall setup

rm(list=ls())

library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr); library(MASS); library(corrplot)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## corrplot 0.84 loaded

Part 2: The Boston housing dataset

The dataset is housing values in the suburbs of Boston and contains aggregated data about median housing prices, crime levels per capita and other possibly relevant things. The data is aggregated by the suburb level and in total, there are 516 observations (suburbs) and 14 different variables, out of which all are either numeric orintegers.

In order to analyze the median housing prices, the so called hedonic pricing method can be used. Hedonic prices is an highly influential idea developed by the late economist Shervin Rose in the 70s: the price of a good can be derived from its characteristics, but of course it can be turned the other way around as well, to analyze crime rates for instance as we are doing here.

Link to the data and its source information: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

data("Boston")

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Part 3: The variables in the data

Below, I have summaries and distributions of the variables as well as a correlation matrix.I comment mostly on the things that appear as interesting to me. The proportion of houses built before 1940 is skewed to the right, thus there are less totally newer suburbs. Well, Boston is an old city, Most of the suburbs are somewhat close to the employment centers of Boston, which makes sense. For the per capita crime-levels, there seems to be an absurd level of crime in one suburb, but most of them are relatively crime-free. The percent of the lower status per population is also skewed to the left, which maeks sense, as Boston has always had that upper midclass sound to it. Lastly, the median (owner-occupied) home value is skewed to the left (below 30k USD). However, there is a noticeable spike on the 50k USD mark.

As for the correlations, crime correlates negatively with housing prices, positively with accessibility to radial highways, the property tax rate and the percentages of lower status population. The median price correlates quite negatively with the percentages of lower status population, which is intuitive. The proximity to Charles River does not correlate that much with, well anything. This is somewhat surprising, as I would have imagined that it would be correlated with housing prices and ages at least.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
gather(Boston)  %>% ggplot(aes(value)) + facet_wrap(~key, scales="free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cor_matrix<-cor(Boston) %>% round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Part 4: Standardizing the variables and other stuff

Standardizing variables or an entire dataset can be done for numerous reasons like for the sake of easier comparability. After the scaling, the variables’ distributions look exactly the same, but they are rescaled to have a mean of zero and standard deviation of one.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
gather(boston_scaled)  %>% ggplot(aes(value)) + facet_wrap(~key, scales="free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
label <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, labels = label, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)


n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Part 5: Linear discriminant analysis

What we can see from the LDA-plot, is that accessibility to radial highways (rad) is an important feature contributing to LDA1, and it is not closely correlated with the other features. The length of the arrow tells us that it is a huge discriminant used in determining the high crime suburbs. LDA2 seems to determine the other crime level suburbs.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes )
lda.arrows(lda.fit, myscale = 2)

Part 6: Predicting with the model

Our model predicts a suburb belonging to the high crime class extremely well, but the other lower crime classes leave a bit to hope for. Especially med_low tends to get classified as med_high. But overall, not too bad!

The reason why high crime suburbs are classified so well by the model can be seen from the LDA-plot: accessibility to radial highways seems to be a huge and distinctive factor in determining these suburbs. “Radan varrella sattuu ja tapahtuu”, some Finns might say.

Music recommendation: “Kuustoista kilsaa Kontulaan”

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       5        2    0
##   med_low    7      12        9    0
##   med_high   0       7       16    2
##   high       0       0        0   28

Part 7: Cluster analysis

The optimal number of clusters seems to be 2. The model suggests that there are two kinds of suburbs in the city of Boston.

A glimpse to the plots reveals, that the black cluster is categorized by more crime, more pollution, being closer to radial highways and being away from employment centers. The red cluster is in many ways the opposite. Or that’s what my intuition tells me!

Since cluster analysis usually relies on the imagination of its user and Boston is a reputable city of Irish, I’m going to name these two kinds of suburb categories as “Carrot Top Hoods” (black cluster) and “True Kelt Dwellings” (red cluster).

data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)

dist_eu <- dist(boston_scaled2)

summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(boston_scaled2, centers = 2)

pairs(boston_scaled2[6:10], col = km$cluster)

pairs(boston_scaled2[1:5], col = km$cluster)


Chapter 5: Dimensionality reduction techniques

Part 0: Libraries used and overall setup

rm(list=ls())

library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr); library(corrplot); library(FactoMineR)

Part 1: Making sense of the data

I’ll start by exploring ratios of labor force participation and secondary educative attainment between genders. Regarding the secondary educatio ratio, there are some countries, where there are more females with secondary education than makes, but overall, it seems that the ratio usually lies somewhere between 80% to 100% (as compared to males; 1 would mean equal amount of persons with secondary education). For the labor force participation rate, majority of countries lie between 0.5 and 1, but distribution of ratios is more uneven than the secondary education’s ratio and there are less cases where there are more females than males. For other participatory rates, females are less likely to be members of parliament, with the mean percent between countries being just 20.91.

As for the other variables, maternal mortality has some drastic outliers, as has life expectancy. A likely explanations for high maternal mortality and low life expectancy is poverty (and warfare, but we cannot identify it from the data) as there are strong negative correlations between GNI and maternal mortality and adolescent birth rates.
For other correlations, life expectancy correlates strongly (+) with expected education years and life expectancy also has is strong negative correlations with maternal mortality rate and adolescent birth rate, which makes sense, as maternal mortality and adolescent birth rate seem to correspondingly correlate strongly (+).

setwd("~/GitHub/IODS-project/data")

human <- read.table("human.txt")
summary(human)
##     ed2r_FM           lfr_FM          Life_exp        Educ_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI           Mat_mortr       Adol_birthr      Percent_MP_F  
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
gather(human)  %>% ggplot(aes(value)) + facet_wrap(~key, scales="free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

M<-cor(human) %>% round(2)
M
##              ed2r_FM lfr_FM Life_exp Educ_exp   GNI Mat_mortr Adol_birthr
## ed2r_FM         1.00   0.01     0.58     0.59  0.43     -0.66       -0.53
## lfr_FM          0.01   1.00    -0.14     0.05 -0.02      0.24        0.12
## Life_exp        0.58  -0.14     1.00     0.79  0.63     -0.86       -0.73
## Educ_exp        0.59   0.05     0.79     1.00  0.62     -0.74       -0.70
## GNI             0.43  -0.02     0.63     0.62  1.00     -0.50       -0.56
## Mat_mortr      -0.66   0.24    -0.86    -0.74 -0.50      1.00        0.76
## Adol_birthr    -0.53   0.12    -0.73    -0.70 -0.56      0.76        1.00
## Percent_MP_F    0.08   0.25     0.17     0.21  0.09     -0.09       -0.07
##              Percent_MP_F
## ed2r_FM              0.08
## lfr_FM               0.25
## Life_exp             0.17
## Educ_exp             0.21
## GNI                  0.09
## Mat_mortr           -0.09
## Adol_birthr         -0.07
## Percent_MP_F         1.00
corrplot(M, type = "upper", method="square", cl.pos = "b", tl.pos = "d", tl.cex = 0.5)

Part 2: PCA with non-standardized variables

[Comparison of non-standardized and standardized PCAs under Part 3.]

pca_human <- prcomp(human)

s <- summary(pca_human)

pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

biplot(pca_human, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Part 3: Comparison between PCAs: non-standardized vs standardized variables

PCA does not funk that well if the variables are not standardized. This is due to the variable scaling: especially GNI seems to vary on a large scale as compared to other variables, which makes the analysis futile. As such, there’s not that much sense in trying to decipher these results, besides understanding that the analysis is ruined by one rampantly varying variable.

My face whilst trying to decipher the results

After we have standardized the variables, the results are very different and the biplot and the components seem meaningful.

A word about PCA, if you may. If we have a ton of variables, many of the variables can measure related attributes and as such are “redundant” in the analysis. In other words, they can measure the same phenomena and as such the same underlying things might affect them. For instance, intuition tells, that female to male labor force participation rate and the percentage of female parliament members are closely correlated outcomes determined by similar things in the society (religion, empowerment of females etc.) and as such, in general they measure the same phenomenon. In general that is, of course there might be drastic exceptions.

Based on the original variables, PCA constructs variables that capture the maximum amount of variance of the variables in the original data and can be used to reduce dimensions used in statistical analysis. Or to be precise, the first component captures maximum amount of variance, the second component captures maximum amount of variance left after the first one and so on. The principal components are uncorrelated, which is useful, since the original data can be a real cluster of heavily correlated variables. In this case, the two principal components depicted in the biplot capture 69,8% of the variance between countries.

We can analyze the first two components in the following manner with the biplot below.

human_std <- scale(human)
summary(human_std)
##     ed2r_FM            lfr_FM           Life_exp          Educ_exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI            Mat_mortr        Adol_birthr       Percent_MP_F    
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_humanstd <- prcomp(human_std)

z <- summary(pca_humanstd)

pca_prz <- round(100*z$importance[2, ], digits = 1)

# print out the percentages of variance
pca_prz
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_labz <- paste0(names(pca_prz), " (", pca_prz, "%)")

# draw a biplot
biplot(pca_humanstd, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"), xlab = pc_labz[1], ylab = pc_labz[2])

Part 4: Personal interpretation of the biplot components

In this case, the two principal components depicted in the biplot above capture 69,8% of the variance between countries.

The components and their main variables (as taken from the biplot arrows’ directions):

Female-male labour force participation and percentages of females as members of parliament contribute most to the PC2. I name this as the FEMALE PARTICIPATION PRINCIPAL COMPONENT. Female participation in politics and in labor markets seems to be a differentiate factor in analyzing the countries.

Expected education, GNI, life expectancy and female-male secondary educational attainment rate (all of them positively correlated + to PC1) as well negatively correlated maternal mortality and adolescent birth rates contribute most to the PC1. I name this OVERALL DEVELOPMENT PRINCIPAL COMPONENT.

So basically, we have the overall development and female participation components and they capture almost 70% of the variance between countries.

Part 5: Multiple correspondence analysis

Since there’s a load of variables, 36 to be exact, with a total of 300 observations, I’m going to limit the number of variables used in the analysis as done with the data camp exercises. All but one variables are factor variables, with at least 2 levels. The only integer variable is age. So basically, we have some tea drinkers here and we are trying to summarize what drives different their tea drinking.

Sooo, MCA is sort of PCA, except for it is used for factor variables. The components explain variance in the manner as with PCA.In the biplot, factor levels with similar profile are in close proximity to each other. What I find interesting, are the following combinations (and my interpretations):

*Teatime + pub + tea bag & unpackaged - Sounds like a highly British combination. Tea, teatime and the local village pub.

*Earl Grey + spirituality - If the nationality of the people that the sample covers is British, then this is an understandable combination! But for some reason, I would have anticipated green tea to be closer to spirituality. Maybe I don’t understand that much about tea drinking.

*Green tea + not tea time - Well sort of makes sense as green tea drinkers might not be the traditional British tea drinkers. Maybe I don’t understand anything about tea drinking.

By the way, MCA reminds me of Beastie Boys.

Music recommendation: “Sabotage”

data("tea")

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
tea_time <- tea[,c("Tea", "how", "spirituality", "tearoom", "pub", "tea.time")]

mca <- MCA(tea_time, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.258   0.214   0.183   0.165   0.144   0.134
## % of var.             19.364  16.050  13.696  12.401  10.789  10.054
## Cumulative % of var.  19.364  35.414  49.110  61.511  72.301  82.354
##                        Dim.7   Dim.8
## Variance               0.124   0.112
## % of var.              9.271   8.375
## Cumulative % of var.  91.625 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.527  0.359  0.275 |  0.502  0.392  0.249 | -0.323
## 2                  | -0.527  0.359  0.275 |  0.502  0.392  0.249 | -0.323
## 3                  | -0.175  0.040  0.060 | -0.061  0.006  0.007 | -0.464
## 4                  | -0.515  0.342  0.299 | -0.558  0.485  0.352 |  0.123
## 5                  | -0.515  0.342  0.299 | -0.558  0.485  0.352 |  0.123
## 6                  | -0.596  0.459  0.598 | -0.130  0.026  0.028 | -0.245
## 7                  | -0.175  0.040  0.060 | -0.061  0.006  0.007 | -0.464
## 8                  | -0.106  0.015  0.012 |  0.571  0.508  0.352 | -0.542
## 9                  |  0.204  0.054  0.056 | -0.032  0.002  0.001 | -0.071
## 10                 |  0.454  0.266  0.108 |  0.617  0.593  0.200 |  0.278
##                       ctr   cos2  
## 1                   0.190  0.103 |
## 2                   0.190  0.103 |
## 3                   0.394  0.423 |
## 4                   0.027  0.017 |
## 5                   0.027  0.017 |
## 6                   0.110  0.101 |
## 7                   0.394  0.423 |
## 8                   0.536  0.317 |
## 9                   0.009  0.007 |
## 10                  0.141  0.040 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.255   1.039   0.021   2.527 |   1.204  27.840
## Earl Grey          |   0.045   0.083   0.004   1.038 |  -0.550  15.145
## green              |  -0.834   4.940   0.086  -5.071 |   0.516   2.281
## tea bag            |  -0.465   7.898   0.282  -9.188 |  -0.214   2.019
## tea bag+unpackaged |   0.693   9.707   0.219   8.092 |  -0.133   0.434
## unpackaged         |   0.385   1.150   0.020   2.461 |   1.358  17.245
## Not.spirituality   |  -0.078   0.271   0.013  -2.001 |   0.372   7.400
## spirituality       |   0.171   0.594   0.013   2.001 |  -0.815  16.217
## Not.tearoom        |  -0.355   6.554   0.525 -12.531 |  -0.046   0.135
## tearoom            |   1.480  27.348   0.525  12.531 |   0.193   0.562
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.475  11.911 |  -0.392   3.460   0.050  -3.879 |
## Earl Grey            0.545 -12.768 |  -0.193   2.188   0.067  -4.483 |
## green                0.033   3.137 |   2.008  40.482   0.498  12.207 |
## tea bag              0.060  -4.230 |  -0.438   9.920   0.251  -8.660 |
## tea bag+unpackaged   0.008  -1.558 |   0.571   9.318   0.149   6.668 |
## unpackaged           0.252   8.674 |   0.578   3.655   0.046   3.689 |
## Not.spirituality     0.303   9.522 |  -0.295   5.469   0.191  -7.562 |
## spirituality         0.303  -9.522 |   0.647  11.986   0.191   7.562 |
## Not.tearoom          0.009  -1.635 |  -0.103   0.775   0.044  -3.625 |
## tearoom              0.009   1.635 |   0.428   3.235   0.044   3.625 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.094 0.581 0.505 |
## how                | 0.291 0.253 0.251 |
## spirituality       | 0.013 0.303 0.191 |
## tearoom            | 0.525 0.009 0.044 |
## pub                | 0.220 0.129 0.026 |
## tea.time           | 0.406 0.009 0.078 |
plot(mca, invisible=c("ind"), habillage = "quali")


Chapter 6: Analysis of longitudinal data

Part 0: Libraries used and overall setup

rm(list=ls())

library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr); library(lme4) 
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
setwd("~/GitHub/IODS-project/data")

ratsl <- read.table("RATSL.txt")
bprsl <- read.table("BPRSL.txt")

ratsl$ID <- factor(ratsl$ID)
ratsl$Group <- factor(ratsl$Group)

I: Implementing the analyses of chapter 8 using the RATS data

I am quite slavishly following the chapters in the books.Like a rat following the pied piper.

The data used in this first part:

*Rats - Three different groups of rats (a total of 16 rats) are fed different food and change in their weight is observed. The data is originally from Crowder & Hand (1990). Used in this first part.

First, I plot the individual rat observations by time. Group signals the colour.

[Important notice: In the Vehkalahti & Everett (2018?) chapter 8, there layout was three graphs, one graph per group that is, but I chose to plot all the rats in the same graph and just distinguish the groups by color, as there are only 16 rats. The information provided by both approaches is somewhat equal.]

Whoa, there is one huge rat in the second group! Otherwise, the initial levels of the rat weight and the slopes seem to be quite close together group-wise.

The biggest rodent in the world is capybara.

Capybaras are wicked cool animals.

Respected by all.

What’s not to love.

ggplot(ratsl, aes(x = Time, y = Weight, group=ID, col = Group)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4))  + 
  scale_y_continuous(name = "rat weight", limits = c(min(ratsl$Weight), max(ratsl$Weight))) 

Now if we standardize the weights (below), the slopes get more equal.There seems to be no differences in the growth rates of the rates time-wise.

ratsl <- ratsl %>%
  group_by(Time) %>%
  mutate(stdweight = scale(Weight)) %>%
  ungroup()

ggplot(ratsl, aes(x = Time, y = stdweight, group=ID, col = Group )) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  scale_y_continuous(name = "Standardized rat weight")

Next, I plot how the mean weight has changed by group.

Not that much new information here, as anticipated, the standard errors are highest among the second group. Yup, the one with that huge rat.

n <- ratsl$Time %>% unique() %>% length()

ratsass <- ratsl %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n)) %>%
  ungroup()

ggplot(ratsass, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Summary measures approach in the form of boxplots. Boxes by groups. Excluding that one huge rat. The groups differ drastically.

ratsass2 <- ratsl %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()

# Glimpse the data
glimpse(ratsass2)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, ...
str(ratsass2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  3 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean : num  263 239 262 267 271 ...
ratsass3 <- filter(ratsass2,mean<550)

str(ratsass3)
## Classes 'tbl_df', 'tbl' and 'data.frame':    15 obs. of  3 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean : num  263 239 262 267 271 ...
ggplot(ratsass3, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), exluding initial Weight")

Next we check whether these differences appear to be statistically significant. As there are three groups, ANOVA is used instead of a two-sample t-test. The null hypothesis is that the means of the groups are the same and we test whether they differ significantly both for the average weight of the inspection period (t>WD1) and for the baseline weigh (t=WD1).

Assumptions are, that the observations are independent from each other, the data of each group is normally distributed and they have a common variance.

According to the Anova, the null hypothesis can be only rejected with the starting weight. Only the baseline weight is significantly different, otherwise, whatever is fed to those poor rats, is not working, which might be a good thing.

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt")

ratsass4 <- ratsass2 %>% mutate(baseline = RATS$WD1)

fit <- lm(mean ~ baseline + Group, data = ratsass4)

anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

II: Implementing the analyses of chapter 9 using the BPRS data.

The data used in this second part:

*Bprs - 40 males are divided into treatment and control groups (20 males in each group). And their brief psychiatric rating scores (bprs) are weekly measured for a total of 8 weeks. Bprs is a rating scale, which measure different psychiatric symptoms (Wikipedia). The data is originally from Davis (2002).

A few words about multilevel modeling, if I may. A basic way to distinguish between different multilevel models is to categorize them into i) random intercept model, ii) random slope model and iii) a combination of the two, random intercept + slope model. Personally, I prefer different naming conventions with “varying intercept model” and “varying slope model”, as they are more intuitive, but that might just be me.

What does random or varying mean in this context? Traditionally with these models, we can take into account that the data structure is nested (to my understanding that is). One classic example of a nested data structure could be GPAs of students from different classes and schools. A usual assumption for regression model is that the observations are independent, but this might not be a valid assumption. With multilevel models we can allow heterogeineity between groups or individuals slopes and/or intercepts. For instance, these different classes and schools might affect the outcome variable of the students in some way (their outcomes are correlated by the schools or classes) and with multilevel models we can get this sort of main effect and then consequently analyze what kind of effects different nested structures might have on the intercepts or slopes in the analysis.

It must be noted that here we use multilevel models to allow slopes and intercepts differ by individual rat. The main point is that traditional regressions assume that the observations are independent, which is a rather invalid assumption with a dependent variable like rat weight. Rat weight at time=0 and weight at time=1 are unlikely to be independent observations. As such, we take theintraclass correlation of an individual rat’s weight between different time points into account, making this a more robust form of analysis.

ON with the show! First I plot the bprs by the treatment and the control groups. It’s quite a mess to be honest. In the plotting of the individual time series, the use of two different graphs for control and treatment groups is justified, even though, it’s still kinda messy

bprsl$treatment <- factor(bprsl$treatment)


ggplot(bprsl, aes(x = week, y = bprs, linetype = as.factor(subject))) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(bprsl$bprs), max(bprsl$bprs)))

Then a standard issue linear regression is performed. The treatment is not statistically significant, while both groups have a declining trend with time. It must be noted that this regression suffers from the issues mentioned above and ignores the repated measures structure of the data.

bprsl_reg <- lm(bprs ~ week + treatment, data = bprsl)
summary(bprsl_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = bprsl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Random intercept model to the rescue! This means, that we allow the intercept to vary for each individual male.

bprsl_int <- lmer(bprs ~ week + treatment + (1 | subject), data = bprsl, REML = FALSE)
summary(bprsl_int)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Random intercept and random slope model

bprsl_intslope <- lmer(bprs ~ week + treatment + (week | subject), data = bprsl, REML = FALSE)
summary(bprsl_intslope)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8202  8.0511        
##           week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

Random intercept and random slope model with timeXgroup interaction

bprsl_intslopeX <- lmer(bprs ~ week + treatment + week*treatment + (week | subject), data = bprsl, REML = FALSE)

Fitted <- fitted(bprsl_intslopeX)
bprsl <- bprsl %>% mutate(Fitted)

summary(bprsl_intslopeX)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0767  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0016  8.0624        
##           week         0.9688  0.9843   -0.51
##  Residual             96.4699  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

Comparison: Observed vs fitted bprs by groups Here they are! Lol!

bprsl <- mutate(bprsl, subject2 = ifelse(bprsl$treatment==2, bprsl$subject+20, bprsl$subject)) 




ggplot(bprsl, aes(x = week, y = bprs, group = subject2)) + 
  geom_line(aes(linetype = treatment))  +
  geom_line(aes(linetype = treatment)) +  
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "top")

ggplot(bprsl, aes(x = week, y = Fitted, group = subject2)) +
  geom_line(aes(linetype = treatment)) +  
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "top")